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Abstract.  We  present  baseline  results  for  the  joint  segmentation  and 
classification  of  dialog  acts  (DAs)  of  the  ICSI  Meeting  Corpus.  Two  sim¬ 
ple  approaches  based  on  word  information  are  investigated  and  compared 
with  previous  work  on  the  same  task.  We  also  describe  several  metrics 
to  assess  the  quality  of  the  segmentation  alone  as  well  as  the  joint  per¬ 
formance  of  segmentation  and  classification  of  DAs. 


1  Introduction 

As  spoken  language  technology  research  moves  toward  more  complex  domains, 
further  processing  of  the  stream  of  words  provided  by  a  recognizer  is  often  nec¬ 
essary.  To  support  higher-level  tasks  such  as  information  retrieval  and  summa¬ 
rization  [1,2],  the  input  speech  signal  must  be  segmented  into  meaningful  units, 
for  example  dialog  acts  (DAs) .  Typical  DA  types  are  statements,  questions,  and 
backchannels.  The  task  we  investigate  here  is  how  to  split  a  stream  of  words  into 
non  overlapping  segments  of  text  and  assigning  mutually  exclusive  DA  types  to 
these  segments.  While  this  task  description  suggests  a  sequential  solution,  an  ap¬ 
proach  based  on  joint  segmentation  and  classification  most  likely  performs  best. 
We  use  the  term  joint  segmentation  and  classification  for  systems  that  do  not 
implement  this  task  in  the  form  of  two  independent  modules  running  in  sequence 
but  produce  their  final  result  by  taking  into  account  information  from  both  the 
segmentation  and  the  classification,  whereas  in  the  sequential  approach  the  seg¬ 
mentation  does  not  take  advantage  of  information  coming  from  the  classification 
of  DAs. 

Previous  work  mainly  concentrated  on  either  the  segmentation  of  speech  into 
sentences  [3,4]  or  the  classification  of  already  segmented  text  into  various  sets 
of  DA  types  [5, 6, 7, 8].  For  automatic  segmentation  of  speech  it  remains  unclear 
how  well  a  subsequent  component  can  handle  segmentation  errors.  For  the  latter 
case,  the  classification  of  DAs,  it  is  typically  assumed  that  the  true  segmentation 
boundaries  are  provided.  As  a  consequence,  a  degradation  of  the  performance 
due  to  imperfect  segmentation  boundaries  must  be  expected.  To  provide  more 
realistic  results  for  the  task  of  automatic  segmentation  and  classification  of  DAs, 
a  sequential  approach  is  described  in  [9].  In  this  paper  we  make  a  first  attempt 
toward  joint  segmentation  and  classification  of  DAs  on  the  ICSI  (MRDA)  Cor¬ 
pus  [10]. 
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Metric  Errors  Reference  Units  Error  Rate 

NIST-SU  3  FA,  1  miss  5  boundaries  80% 

DSER  3  match  errors  5  DAs  60% 


Fig.  1.  Two  metrics  for  the  assessment  of  segmentation  performance.  S,  Q,  B,  and 
D  represent  words  of  statements,  questions,  backchannels,  and  disruptions.  DA 
boundaries  are  indicated  using  the  symbol  ‘  |  while  ‘ ’  is  used  for  nonboundaries. 
Errors  and  correct  cases  are  indicated  using  letters  E  and  C. 


2  Methodology  and  Performance  Metrics 

For  the  joint  segmentation  and  classification  of  DAs,  two  simple  techniques  are 
investigated  in  this  paper.  The  first  technique  is  based  on  a  Hidden-Event  Lan¬ 
guage  Model  (HE-LM)  described  in  [11],  and  the  second  relies  on  a  Hidden 
Markov  Model  (HMM)  based  tagger.  The  HE-LM  is  frequently  used  for  detec¬ 
tion  of  sentence  boundaries  [9,4],  where  after  each  word  the  model  predicts  a 
nonboundary  or  a  sentence  boundary  event.  In  contrast,  we  use  the  HE-LM  to 
predict  not  only  a  DA  boundary  or  a  nonboundary  event,  but  the  type  of  the  DA 
boundary  at  the  same  time.  This  extension  to  [11]  was  also  used  in  [3]  to  detect 
sentence  boundaries  and  5  different  types  of  disfluencies.  In  our  case  the  DA- 
specific  boundary  posterior  probabilities  are  computed  using  forward-backward 
dynamic  programming.  The  model  can  be  seen  as  a  nth  order  HMM  in  which 
the  word/event  pairs  correspond  to  states  and  the  words  to  observations,  with 
the  transition  probabilities  given  by  the  n-gram  LM. 

The  second  technique  relies  on  the  concept  of  disambiguation  of  words,  which 
is  widely  used  in  the  form  of  HMM  based  Part  of  Speech  (POS)  taggers.  In 
our  case  conventional  n-gram  LM  are  used  to  model  the  priors  of  sequences 
((wi,  di),  (w 2,  ^2 ),  •••  ( wn ,  dn)).  The  Wi  are  the  words  from  the  lexicon  provided 
by  the  Speech  to  Text  (STT)  system  and  the  di  represent  specific  DAs  such  as 
statements,  questions,  etc.  To  model  segmentation  boundaries  between  words 
of  the  same  DA  type,  the  lexicon  of  the  DA  types  also  includes  special  symbols 
indicating  the  first  word  of  a  new  DA  (e.g.  the  symbol  S+  tags  the  first  word  of  a 
statement,  while  the  other  words  of  a  statement  are  tagged  with  an  S).  Mapping 
probabilities  p(w\(w,  d))  are  then  estimated  from  the  training  corpus  that  is  used 
to  train  the  n-gram  LM.  Simple  add-1  smoothing  is  applied  to  account  for  unseen 
word-DA  combinations.  Finally,  the  sequence  ((wi,  d\),  (w2,  c^),  •  •  •)  with  the 
highest  posterior  probability  is  computed  for  a  provided  input  sequence  (w\,  W2, 
...). 

To  assess  the  performance  of  joint  segmentation  and  classification  of  DAs, 
a  number  of  measures  have  been  proposed.  We  first  describe  two  metrics  for 
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Errors  Reference  Error  Rate 

1  sub.,  3  FAs,  1  miss  5  boundaries  100% 

5  match  errors  11  words  45% 

10  match  errors  11  words  91% 

4  match  errors  5  DAs  80% 


Fig.  2.  Comparison  of  metrics  to  measure  joint  performance  of  segmentation 
and  classification  of  DAs. 


the  measurement  of  the  segmentation  performance  before  metrics  for  the  joint 
segmentation  and  classification  of  DAs  are  explained.  The  NIST-SU  metric  was 
used  to  report  the  segmentation  performance  in  previous  work  [9]  and  has  been 
provided  by  NIST  in  the  EARS  MDE  evaluations  [12].  As  this  measure  takes  into 
account  only  the  local  correspondence  of  reference  boundaries  and  boundaries 
computed  by  the  system,  a  direct  interpretation  of  the  resulting  error  rates  is  not 
always  easy.  To  provide  a  more  intuitive  metric  we  propose  the  DA  Segmentation 
Error  Rate  (DSER),  which  measures  the  percentage  of  wrongly  segmented  DA 
segments.  A  DA  is  considered  to  be  mis-segmented  if  and  only  if  its  left  and/or 
right  boundary  does  not  correspond  to  the  reference  segmentation  exactly.  This 
also  implies  that  missed  and  false  alarm  (FA)  cases  are  penalized  differently 
under  the  NIST-SU  and  the  DSER  metrics.  See  Fig.  1  for  an  illustration. 

For  the  assessment  of  the  joint  performance  of  the  segmentation  and  classi¬ 
fication  of  DAs,  four  different  metrics  are  used  in  the  experiments  described  in 
Sec.  3.  These  metrics  are  illustrated  in  Fig.  2.  First,  the  NIST-SU  error  met¬ 
ric  is  adapted  to  also  include  substitutions,  not  only  missed  boundaries  or  false 
alarms.  Substitutions  occur  when  the  system  outputs  a  DA  boundary  at  the  cor¬ 
rect  position  but  the  reference  and  the  system  disagree  on  the  DA  type  on  the 
left  side  of  the  boundary.  The  word-based  lenient  and  strict  metrics  have  been 
introduced  in  [9] .  The  lenient  metric  does  not  take  into  account  the  segmentation 
boundaries  but  only  compares  the  DA  types  assigned  to  corresponding  words. 
For  the  strict  metric,  a  word  is  considered  to  be  correctly  classified  if  and  only 
if  it  has  been  assigned  the  correct  DA  type  and  it  lies  in  exactly  the  same  DA 
segment  as  the  corresponding  word  of  the  reference. 

As  a  semantically  easy  to  interpret  metric  for  the  joint  segmentation  and 
classification  of  DAs  we  propose  the  DA  Error  Rate  (DER) .  This  metric  is  derived 
from  the  DSER  and  not  only  requires  a  DA  to  have  exactly  matching  boundaries 
but  also  to  be  tagged  with  the  correct  DA  type.  The  DER  thus  measures  the 


percentage  of  the  misrecognized  DA  and  can  be  seen  as  a  length-normalized 
version  of  the  strict  metric. 

For  completeness  we  also  mention  the  recognition  accuracy  described  in  [13] 
that  corresponds  to  the  classical  word  error  rate.  As  in  the  case  of  the  word 
error  rate  the  accuracy  metric  of  [13]  only  relies  on  the  sequence  of  symbols  (DA 
types  in  our  case)  and  does  not  consider  the  actual  segmentation  boundaries. 
Scoring  is  then  based  on  the  string  edit  distance.  This  metric  is  not  used  in  the 
experiments  below. 


3  Experiments  and  Discussion 

For  all  experiments  reported  here  the  experimental  setup  used  is  as  described 
in  [9].  Of  the  75  available  meetings  of  the  ICSI  MRDA  corpus,  two  meeting  of  a 
different  nature  are  excluded  (BtrOOl,  and  Btr002).  From  the  remaining  meetings 
we  use  51  for  training,  11  for  development,  and  11  for  evaluation.  For  the  seg¬ 
mentation  and  classification  of  the  DA  types,  the  available  speech  is  first  sorted 
according  to  the  speaker,  and  then  by  time.  The  available  DA  types  are  mapped 
to  the  following  five  distinct  types:  backchannels  (B),  disruptions  (D),  floor  grab¬ 
bers  (F),  questions  (Q),  and  statements  (S).  Each  system  is  then  optimized  and 
evaluated  under  both  reference  and  STT  conditions.  Under  the  reference  condi¬ 
tion  it  is  assumed  that  we  have  access  to  the  true  sequence  of  the  spoken  words, 
while  under  the  STT  condition  the  recognizer’s  top-choice  sequence  of  words  is 
provided.  The  sequential  approach  to  segmentation  and  classification  of  DAs  de¬ 
scribed  in  [9]  differs  in  a  number  of  aspects  from  the  systems  investigated  in  this 
paper.  Major  differences  lie  in  its  sequential  nature  and  the  usage  of  prosodic 
and  word-based  information  for  both  segmentation  and  classification  of  DAs. 
Prosody  has  been  shown  to  help  both  the  segmentation  [4]  and  the  classification 
of  DAs  [7].  While  this  system  has  the  potential  drawback  of  working  in  a  se¬ 
quential  fashion  it  is  taking  advantage  of  prosody  in  the  segmentation  step  and 
access  to  the  complete  DA  segment  for  classification.  The  potential  advantage  of 
the  systems  described  in  this  paper  lies  in  their  ability  to  produce  segmentation 
boundaries  that  are  based  on  the  estimation  of  the  previous  DA  type  for  the  last 
n  words.  However,  both  the  HE-LM  and  the  tagger  approach  decide  to  segment 
and  classify  DAs  based  on  local  information  only.  Since  the  classification  of  the 
DA  is  implicitly  done  by  predicting  a  corresponding  DA  boundary,  valuable  in¬ 
formation  is  lost  when  the  beginning  of  the  current  DA  has  fallen  out  of  the 
current  n-gram  context. 

Segmentation  performance  results  of  the  different  systems  are  provided  in 
Table  1.  To  better  compare  the  integrated  approaches  with  the  previous  results, 
we  report  the  segmentation  error  rate  for  [9]  using  the  HE-LM  alone  without 
taking  into  account  the  prosodic  pause  feature.  Please  note  that  due  to  a  minor 
difference  in  the  counting  of  errors  under  STT  conditions  the  error  rates  given 
in  Table  1  are  slightly  lower  than  those  previously  reported  in  [9].  Comparing 
the  HE-LM  and  the  tagger  approach  of  this  paper,  we  notice  that  the  HE-LM 
consistently  outperforms  the  tagger  on  both  segmentation  metrics. 


Condition 

System 

NIST-SU 

DSER 

[9] 

34.5 

40.8 

Ref 

[9]  np1 

46.0 

53.0 

HE-LM 

46.3 

55.3 

Tagger 

51.1 

61.7 

[9] 

45.5 

49.4 

STT 

[9]  up1 

59.5 

62.0 

HE-LM 

59.6 

62.4 

Tagger 

62.8 

66.9 

1  reduced  system,  no 

prosody  features 

Table  1.  Comparison  of  the  segmentation 
under  both  reference  and  STT  conditions. 

Condition  System  NIST-SU 

error  rates  of  the 

Lenient  Strict 

different  systems 

DER 

[9] 

52.6 

20.0 

64.4 

54.4 

Ref 

[9]  np1 

62.3 

21.0 

72.4 

64.1 

HE-LM 

62.2 

23.3 

74.3 

66.5 

Tagger 

69.5 

22.6 

78.6 

72.6 

[9] 

68.3 

25.1 

75.4 

64.3 

STT 

[9]  np1 

78.3 

25.0 

82.9 

73.2 

HE-LM 

78.0 

26.2 

83.8 

73.9 

Tagger 

81.3 

22.4 

85.4 

77.3 

1 

reduced  system,  no  prosody  features 

Table  2.  Comparison  of  the  segmentation  and  classification  performance  of  the 

different  systems  under  both  reference  and  STT  conditions. 

Performance  results  for  the  joint  segmentation  and  classification  of  DAs  are 
provided  in  Table  2  for  the  different  systems.  Again,  performance  results  for 
the  reduced  version  of  [9]  (not  including  prosody)  is  used  for  better  comparison 
with  the  HE-LM  and  the  tagger  based  methods.  Against  these  results,  the  HE¬ 
LM  approach  shows  a  comparable  performance  which  is  promising  given  the 
simplicity  of  the  approach.  As  we  would  expect,  the  system  described  in  [9]  in 
its  original  form  outperforms  the  approaches  investigated  here.  A  notable  result 
from  these  experiments  is  the  observation  that  the  tagger  based  approach  shows 
the  lowest  lenient  error  rates  and,  at  the  same  time,  the  highest  error  rates  for 
the  NIST-SU,  the  strict,  and  the  DER  metric.  This  observation  suggests  that 
the  lenient  metric  is  most  useful  when  used  in  combination  with  other  metrics 
that  take  into  account  the  quality  of  the  segmentation  as  well. 

4  Conclusion  and  Outlook 

We  have  investigated  two  simple  approaches  based  on  word  information  for  joint 
segmentation  and  classification  of  DAs  in  multiparty  meetings.  Furthermore, 


with  the  DSER  and  the  DER  we  propose  additional  performance  metrics  for 
segmentation  and  joint  segmentation  and  classification  of  DAs  with  a  simple 
semantic  interpretation.  The  DSER  measures  the  percentage  of  the  correctly 
segmented  DAs,  while  the  DER  quantifies  the  percentage  of  correctly  segmented 
and  tagged  DAs.  As  the  investigated  methods  do  not  take  into  account  prosodic 
features,  it  comes  as  no  surprise  that  the  overall  performance  of  these  systems  is 
not  always  as  good  as  previous  work.  Based  on  the  experiments,  we  suggest  that 
the  lenient  metric  proposed  in  [9]  should  not  be  used  alone  but  in  combination 
with  other  metrics  that  take  into  account  the  quality  of  the  segmentation  as 
well. 

The  results  provided  in  this  paper  serve  as  a  baseline  against  which  we  will 
measure  the  results  of  future  work  on  joint  segmentation  and  classification.  As  a 
next  step  we  will  investigate  approaches  that  do  not  rely  on  local  evidence  only 
but  are  able  to  take  into  account  complete  DA  hypotheses  along  the  lines  of  [13]. 
In  such  a  framework  it  is  also  possible  to  integrate  prosodic  information  and  to 
consider  word  lattices. 
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